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Abstract 

In  this  paper,  a  novel  test  strategy,  the  Loop  Testing  Architecture  (LTA)  is  introduced  to 
reduce  aliasing  probability  and  testing  time  for  muUichip  modules.  This  is  accomplished  by 
connecting  Cascadable  Built-In  Testers  (CBITs)  in  neighboring  pipelined  stages  to  increase  the 
length  of  the  test  suites.  Fundamental  properties  of  LTA  supporting  the  randomness  in  the 
generated  test  patterns  (state  coverage)  and  the  asymptotic  aliasing  probability  are  presented.  Our 
results  on  two  small-scale  multi-processor  configurations  show  that  the  aliasing  probability  in 
analyzing  signatures  compared  to  that  of  a  MLFSR  [1]  is  comparable  but  with  fairly  low  area 
overhead,  and  when  compared  with  the  Circular  Self-Test  Path  technique  [14],  less  testing  time  is 
required  by  LTA. 

Further  evaluation  on  the  potential  capabilities  provided  by  the  LTA,  compared  with 
boundary  scan  and  other  pipelined  test  scheduling  approaches  confirmed  that  LTA  provides  a  new 
framework  for  designing  effective  testable  systems. 
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Introduction 


The  next  generation  packaging,  multichip  modules  (MCMs)  which  interconnect  multiple 
bare  dies  by  means  of  a  stack  of  conductive  and  dielectric  thin  film,  offers  tremendous  advantages 
such  as  reduced  time  delays  between  chips,  less  electrical  noise  and  cross  talk,  simplified  power 
distribution,  and  small  size.  However,  large  I/O  lead  counts  and  the  high  density  interconnects 
decrease  testing  throughput  and  accelerate  testing  cost.  Traditionally,  testing  is  performed 
hierarchically.  Chips  are  tested  individually  before  assembly,  and  the  assembled  module  is  tested 
for  correctness  to  avoid  any  errors  introduced  during  packaging.  Methods  include  bed-of-nails 
fixtures  and  hand  held  diagnostic  probes  become  infeasible  and  cost-ineffective  when  new 
technologies  such  as  MCM  and  surface-mounted  devices  are  introduced.  This  is  both  due  to  often 
incomplete  or  unavailable  test  vectors  from  chip  manufacturers  and  the  internal  module’s  low 
observability.  The  new  approach  in  testing  is  built-in  test  (BIT),  where  a  small  circuitry  is  included 
in  the  circuit/system  under  test  (CUT/SUT).  Examples  of  well  known  BIT  techniques  are  scan 
design  [8],  Built-in  Logic  Block  Observer  (BILBO)  [9],  etc. 

Scan  design  methods  involve  disconnecting  the  memory  elements  and/or  the  flip-flops, 
from  the  combinational  logics.  The  main  problem  with  the  Scan  method  is  the  overwhelming 
amount  of  test  outputs  generated  by  any  relatively  large  circuit.  One  popular  data  compaction 
solution  is  signature  analysis,  which  utilizes  a  linear-feedback  shift  register  (LFSR)  to  receive  and 
modify  output  data.  The  residue  in  the  shift  register,  also  called  the  signature,  of  a  faulty  circuit 
will  differ  from  that  of  a  good  circuit  after  a  long  sequence  of  test  patterns.  Therefore,  a  combined 
boundary-scan  and  built-in  self  test  (BIST)  technique  is  recommended  for  board-level  testing  [8] 
to  test  complex  circuits  more  efficiently. 

BILBO  is  a  technique  that  combines  the  basic  features  of  scan  design  with  those  of 
signature  analysis  [9].  Feedback  paths  are  formed  in  the  shift  registers  by  XORing  some  outputs 
from  the  flip-flops  and  connecting  back  to  some  of  the  inputs  of  the  flip-flops.  The  XOR  patterns 
is  fixed  for  a  given  width  of  BILBO  implementing  primitive  polynomial.  Extra  control  ports  are 
added  to  the  shift  registers  where  one  combination  of  the  control  signals  configures  BILBO  into  a 
multiple-input  shift  register  (MISR)  for  compacting  circuit  responses.  The  original  8-bit  BILBO 
design  has  not  kept  pace  with  the  bandwidth  of  today’s  computer  design  whose  internal  VLSI  bus 
paths  have  long  been  extended  from  8-bit  to  16-bit  or  even  32-bit  wide.  Therefore,  it  is  essential  to 
redesign  the  BILBO  to  accommodate  the  wider  bus  path  of  today’s  complex  VLSI  systems.  In  [10], 
a  family  of  concatenating  polydividers  with  primitive  characteristic  polynomials  were  proposed 
trying  to  resolve  the  unextendable  BILBO  problem  for  packaged  chips. 

In  order  to  minimize  hmdware  overhead,  the  design  time,  but  still  maintain  certain  state  and 
fault  coverages,  we  propose  a  bytewise  cascadable  built-in  tester  (CBIT)  macro  cell  with  optimum 
primitive  characteristic  polynomial.  The  purpose  is  to  keep  the  CBIT  cell  in  a  design  library  such 
that  the  circuit/system  designers  can  easily  construct  the  necessary  feedback  path  for  their  BIST 
circuitry.  Previous  work  on  circular  self-testing  path  (CSTP)  [14]  also  accomplished  cascadability, 
however,  by  simply  connecting  registers  in  a  circuit  to  form  a  closed  loop  of  which  the  feedback 


m  1C  (i^ALrr:  imFEcrETj  g  /I  'J 


»  i 


n 


i 


j. 

polynomial  is  .v  +  1 .  The  choice  of  the  feedback  characteristic  polynomial  of  the  CSTP  is  non¬ 
primitive,  therefore,  the  CSTP  approach  can  be  viewed  as  a  special  application  of  the  CBIT.  The 
perfomiance  of  CSTP  is  not  good  as  a  result  of  its  feedback  polynomial  being  non-primitive 
Specifically,  sufficiently  long  testing  time  is  recommended  for  CSTP’s  aliasing  probability 
approximating  to  the  asymptotic  value,  where  N  is  the  input  width  of  the  CUTs  [14], 

To  further  improve  on  testing  time,  we  propose  a  novel  approach,  referred  to  as  the  Loop 
Testing  Architecture  (LTA),  based  on  CBITs  for  testing  MCMs  concurrently.  The  LTA  utilizes 
CBlTs  in  a  pipe  interwoven  with  chips  in  high  I/O  count  chips  on  MCMs.  Simulation  results  show 
that  this  establishment  guarantees  high  test  coverage  with  the  employment  of  maximum-length 
pseudo-random  sequence  (PRS)  for  test  pattern  generation.  And  the  aliasing  probability  is 
comparable  to  that  provided  by  a  two-fold  MLFSR  [1]  with  only  a  fraction  of  the  area  necessary. 

The  need  for  a  parallel  exhaustive  testing  is  significant  for  MCMs.  The  original  test  vectors 
for  chips  with  high  I/O  counts  from  different  manufacturers  may  not  be  available  for  the  functional 
testing  of  the  assembled  module  [15].  In  this  case,  parallel  pipelined  exhaustive  testing  using  LTA 
becomes  imperative  for  the  MCM  designers  to  achieve  better  fault  coverage  in  an  efficient  time 
frame  than  Boundary  Scan.  For  chips  without  BIST  circuitry,  arrays  of  CBITs  can  be  provided  to 
the  MCM  in  the  forms  of  a  small  chip  on  the  same  substrate  or  off-MCM  test  circuitry.  For  chips 
with  existing  on-chip  BIST  structure,  LTA  can  easily  be  supported. 

This  article  discusses  the  CBIT  design,  the  construction  of  the  LTA  pipes,  and  the 
underlying  properties  supporting  such  design.  Measurable  bounds  forevaluation  of  the  LTA  from 
two  sample  systems  will  be  presented,  followed  by  comparisons  with  the  existing  boundary  scan 
and  pipelined  BILBO  wiih  conflict  scheduling  [6]  [7]  in  terms  of  testing  time  and  area  overhead. 

Cascadable  Built-in  Testing  Structure 

The  design  goal  of  CBIT  is  to  provide  a  macro  cell  in  the  design  library  expediting  the  BIT 
design  process.  CBIT  cells  are  cascaded  to  form  a  CBIT  suite  utilizing  multiplexors  and  XORs 
placed  in  strategic  locations  to  construct  different  feedback  paths,  thus  generating  primitive 
polynomials  in  multiple  byte  configuration.  A  CBIT  suite  with  feedback  connections  representing 
a  primitive  polynomial  acts  as  a  maximum-length  PRS  generator  [5].  CBIT  performs  not  only  test 
pattern  generation  and  signature  analysis,  but  permits  cascadability  to  generate  a  maximal  length 
pseudo-random  sequence  (PRS).  In  performing  signature  analysis,  a  primitive  characteristic 
polynomial  gives  a  quicker  convergence  of  the  smaller  asymptotic  aliasing  probability  for  a  given 
test  length  [4]. 

The  CBIT  Design 

A  CBIT  cell  is  a  modified  eight  bit  BILBO.  It  has  three  control  signals  [2]:  C,;,  Cy  and  C^; 
eight  parallel  inputs  (D-bus),  eight  parallel  outputs  (Q-bus),  an  LFSR  consisting  of  eight  flip-flops, 
and  XORs  providing  feedback  path  of  the  LFSR.  Two  serial  data  ports,  Scan_In  and  Scan_Out,  are 
used  for  the  scan  path.  Finally,  Feedbackjn  and  Feedback_Out  provide  the  cascading  links  among 
CBITs.  Fig.  la  shows  the  8-bit  CBIT  cell  and  Fig.  lb  is  a  16-bit  CBIT  suite  configured  from  two 


CBIT  cells  [2], 

The  feedback  pattem/generating  polynomial  for  the  CBlTs  is  chosen  so  that  the  maximum- 
length  PRS  will  be  generated  in  both  the  8-bit  and  16-bit  cases.  Notice  that  in  the  16-bit  case,  the 
feedback  path  for  the  least  significant  CBIT  suite  is  different  from  the  most  significant  CBIT  suite 
(Fig.  lb)  since  the  generating  polynomial  for  the  16-bit  CBIT  has  to  be  prime  in  order  to  guarantee 
the  maximum  randomness  and  quick  convergence  to  the  asymptotic  value  of  the  aliasing 
probability  [4].  In  general,  CBITs  can  be  cascaded  to  make  extended  length  MISRs  to  fit  the 
increasing  size  of  the  data  buses  without  redesigning  the  detail  of  the  BIST  circuitry.  This  will  help 
to  speed  up  the  design  modification  cycle  to  make  the  original  designs  more  testable. 

Operation  Modes  Provided  by  CBIT 

There  are  three  modes  of  operation  (Fig.  2):  parallel  register,  scan  path,  and  MISR.  The 
parallel  register  mode  is  for  normal  operation.  CBITs  form  pipelined  parallel  registers  in  the  data 
path.  The  scan  path  mode  is  for  initialization  and  signature  read-out.  Non-zero  seeds  are  shifted  in 
via  the  Scan_In  port  and  signatures  are  read  out  through  the  Scan_Out  port.  A  scan  path  can  be 
formed  through  the  pipe  to  read  out  signatures  in  the  intermediate  stages  as  well. 

For  testing,  the  CBITs  are  configured  in  the  MISR  mode  which  concurrently  perform 
pseudo-exhaustive  test  pattern  generation  forthe  succeeding  CUT  and  output  signature  analysis  for 
the  previous  CUT.  The  combinations  of  the  three  control  signals,  C^,  Cy  and  C^.  providing  three 
major  operations  are  summarized  in  Table  1 : 
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I  Table  1 :  Control  signals  and  the  corresponding  settings  of  the  CBIT  operation  modes 

^  As  shown  by  the  last  three  rows  in  Table  1,  the  combinations  of  Cy  and  C^  enable  the 

I  cascading  of  the  CBITs. 

j 

I  Pipelining  for  Self  Testing 

^  Constructing  A  Pipe  with  CUTs  and  CBITs 

I  In  addition  to  the  horizontal  extension  of  the  CBITs  to  accommodate  large  I/O  MCM 

I  testing,  further  reduction  in  testing  time  can  be  accomplished  when  several  functional  blocks^  in 
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Control  Signals  (C^  Cy  C^) 

Configuration 

(1  1  1) 

parallel  register  mode 

(0  1  -) 

scan  path  mode 

(1--) 

MISR  mode 

(10  1) 

most  significant  byte  for  cascading 

(1  10) 

least  significant  byte  for  cascading 

(10  0) 

single  byte  MISR 
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an  MCM  form  a  pipe  where  blocks  in  the  pipe  can  be  tested  concurrently.  Several  pipes  can  be 
constructed  according  to  their  functionality  and  data  widths.  Each  pipe  consists  of  one  zero-th  stage 
CBIT  suite,  and  subsequent  stages  of  block  and  CBIT  set. 

Functional  blocks  with  similar  number  of  inputs/outputs  are  clustered  to  form  a  pipe.  CBIT 
suites  with  corresponding  width  are  then  constructed  to  match  the  data  width  of  each  pipe.  For 
those  CUTS  with  very  limited  outputs  (e.g.,  encoders),  it  is  possible  that  more  CUTs  can  be 
clustered  and  analyzed  by  the  CBIT  suites  at  each  stage.  Alternatively,  several  shorter  or  smaller 
width  pipes  can  be  constructed  by  the  partition/segmentation  process  mentioned  in  [16].  Fig.  3a 
illustrates  how  pipes  for  a  data  path  in  the  SUT  are  constructed,  and  Fig.  3b  for  a  control  path  which 
usually  has  non-uniform  input/output  bit-width  or  branched  signal  flows.  The  proper  length  of  any 
given  pipe  is  determined  by  the  requirements  on  the  state  coverage,  the  fault  coverage,  and  the 
aliasing  probability.  Preliminary  results  can  be  found  in  our  previous  paper  [2].  Once  the  set  of 
pipes  are  formed,  the  number  of  stages  in  each  pipe  may  be  rearranged  such  that  most  of  the  pipes 
can  finish  self-testing  simultaneously.  Normally  existing  data  paths  with  pipelining  form  natural 
self-testing  pipes.  When  the  pipe  becomes  too  long  that  needs  to  be  decomposed  into  two  shorter 
ones,  only  the  zero-th  stage  CBIT  suite  is  added  to  the  a  second  pipe.  All  pipes  are  created  under 
this  guideline  after  the  rearrangement  phase  to  give  the  maximum  parallelism  for  this  scheme. 

For  high  fan-in  CUTs,  it  is  desirable  to  decompose  the  original  network  into  segments  with 
fewer  fan-ins  [16].  The  controllability,  detectability,  and  observability  measures  of  a  segmented 
circuit  is  exactly  the  same  as  that  for  the  unsegmented  CUT  but  with  less  computation  effort  [17]. 
Segments  can  be  grouped  into  clusters  by  adopting  algorithms  proposed  in,  e.g.,  [18]  and 
oftentimes,  clusters  identify  natural  LTA  pipes 

The  Loop  Testing  Architecture  (LTA) 

Because  of  the  degeneration  of  the  cumulative  test  results  over  multiple  stages,  there  exists 
the  need  for  higher  test  coverage  and  lower  aliasing  probability  in  the  pipelined  MISR  operation. 
One  such  improvement  involves  further  cascading  CBlTs  in  neighboring  stages  using  the 
Feedback_In/Feedback_Out  lines  to  increase  the  length  of  the  CBIT  suite.  The  scan  lines  can  also 
be  constructed  to  facilitate  scanning  out  signatures  from  all  CBITs  serially  after  the  test  session. 
This  is  referred  to  as  the  Loop  Testing  Architecture  (LTA). 

Fig.  3a  also  illustrates  the  construction  of  LTA  where  the  Feedback_In  is  selected  for  the 
CBIT  which  performs  the  most  significant  unit  analysis  and  Feedback_Out  is  selected  for  the  least 
significant  CBIT.  The  Scan_In  and  Scan_Out  ports  of  the  CBITs  at  each  stage  can  be  daisy-chained 
to  give  a  scan  path  for  the  initialization  and  scanning  out  the  final  signature  of  each  CBIT  (also 
shown  in  Fig.  5).  The  last  single  CBIT  suite  of  a  pipe  can  be  connected  to  the  zero-th  stage  CBIT. 
Thus  for  each  pipe,  we  have  double-length  CBIT  suites  for  signature  analysis  which  would  result 
in  smaller  aliasing  probability  for  the  whole  pipe.  Fig.  4  shows  the  equivalent  data  flow  when  the 
CBITs  are  paired  to  do  a  double-length  signature  analysis.  Those  grayed  functional  blocks  (FI,  F2. 
F4,  etc.)  are  replicated  to  show  the  paired  testing  flow  when  two  CBITs  are  cascaded  in  the  LTA. 

1.  We  refer  the  functional  blocks  to  those  CUTs  in  a  SUT  and  modules  to  CUT/SUT  with  BIT  circuits. 
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Arbitrary  length  of  neighboring  CBITs  can  be  created  using  LTA  for  desirable  aliasing  probability. 

Evaluation  of  the  Pipelining  Test 

For  testing,  all  of  the  CBITs  are  configured  as  MlSRs,  and  are  used  for  test  pattern 
generation  as  well  as  signature  analysis.  This  is  justified  by  two  assumptions  [5]; 

(i)  The  result  of  the  input  (seed)  and  the  current  state  under  the  operation  governed  by  the 
characteristic  polynomial  of  the  LFSR/MISR  shall  not  be  0  for  any  state  of  the  PRS. 

(ii)  Multiple  inputs/seeds  to  the  LFSR/MISR  still  traverse  through  all  the  states  of  the  PRS; 
the  degeneration/missing  of  some  states  in  the  PRS  because  of  some  special 
combinations/sequences  of  the  seeds  is  not  considered  in  current  discussion. 

These  are  further  proved  in  [13]  where  the  properties  pertaining  to  the  randomness  of  the 
patterns  generated  by  a  MISR  exist  even  if  the  inputs  are  non-equal-probable. 

To  justify  the  pipelined  LTA  approach,  we  need  to  prove  two  things  regarding  the  dual  use 
of  the  intermediate  CBITs.  First,  we  need  to  show  that  these  CBITs  are  effective  as  TPGs  where 
patterns  generated  are  indeed  maximum-length  PRS  at  each  stage.  This  is  supported  by  the  pseudo¬ 
random  property  of  the  generating  polynomial  of  the  CBIT  in  the  MISR  mode,  and  is  measured  by 
the  percentage  of  the  corresponding  maximum-length  PRS.  Second,  we  need  to  show  that  the 
limited  output  patterns  of  these  functional  blocks  do  not  disturb  the  randomness  of  the  signature, 
where  the  aliasing  probability  remains  acceptably  small  after  a  number  of  stages. 


Properties  of  the  Pseudo-Random  Test  Pattern  Generation  from  CBIT 

When  CBITs  are  constructed  by  LFSRs  with  primitive/irreducible  characteristic 
polynomials,  they  possess  the  following  major  properties  [5]: 

(i)  Every  element/state  OC  in  the  PRS  generated  by  the  LFSR  has  a  complementary 
element/state  a  in  the  same  sequence  such  that  a  +  a  =  0  (N-bit  wide  O’s).  where  ‘-t-’ 
represents  the  operation  on  the  two’s  complementar>'  elements  of  the  PRS  defined  by 
the  characteristic  polynomial  of  the  LFSR. 

(ii)  For  the  cyclic  PRS,  more  than  one  input  seed  will  either  decompose  the  original 
maximum-length  cycle  to  more  than  one  sub-cycles  or  merge  at  least  two  sub-cycles 
together. 

(iii)  The  total  number  of  (distinct)  states  of  all  the  sub-cycles  (if  decomposed  by  multiple 
seeds)  are  (2^-  1 )  for  N  stage  LFSR.  In  this  manner,  the  0  state  is  excluded  from  the 
PRS  and  forms  a  trivial  cycle  (0  — >  0)  for  the  LFSR. 

When  CBITs  are  cascaded  to  generate  test  patterns  for  different  functional  blocks  in  a  pipe, 
we  have  the  following  observation: 

(i)  Each  (at  most)  CUT  with  N-bit  wide  input  buses  only  needs  (2''-  1)  different  test 
patterns  to  finish  the  exhaustive  self-testing. 
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(ii)  In  order  to  test  the  paired  CUTs  each  with  no  more  than  N  inputs  exhaustively,  at  least 
(2''-  1)  but  no  more  than  1)  test  patterns  should  be  generated  by  one  pair  of 
cascaded  CBITs.  For  example,  given  an  eight-input  CUT,  we  need  (2*  -  1)  test 
patterns.  If  there  is  no  correlation  between  the  two  neighboring  CUTs  for  one  pair  of 
cascaded  CBIT  suite,  we  need  one  maximum-length  cycle  of  the  8-bit  wide  PRS  to 
have  one  8-bit  CUT  fully  tested.  However,  because  of  the  equally  distributed  I’s  in 
the  16-bit  wide  PRS  [5J,  two  8-bit  CUTs  should  be  fully  tested  before  the  extended 
PRS  reaches  its  maximum-length  period  (which  is  (2‘*-  1) ).  The  actual  number  of 
the  test  patterns  needed  to  fully  test  m  CUTs  simultaneously  using  m  cascaded  CBITs 
depends  on  the  characteristic  polynomial  of  the  extended  CBITs  and  the  input  seeds. 
But  in  general,  we  have  the  following  relation: 

(2^- 1)  <T<  (2'"'^- 1) 

where  L  is  the  test  length  needed  to  test  m  CUTs  exhaustively  given  m  CBIT  suites 
cascaded  for  single  stage  analysis  in  a  pipe. 

Therefore,  CBITs  are  effective  as  TPGs  when  the  lest  length  is  appropriately  chosen 
according  to  Eq.  (1). 

Aliasing  Probability  for  Single  Stage  MlSRs 

CBITs  in  the  MISR  mode  is  a  special  case  of  the  generalized  linear-feedback  shift  register 
(GLFSR)  [1].  A  GLFSRfm,  N)  is  a  generalized  m-stage,  N-input  signature  analyzer  with  linear 
feedback  patterns  built  over  the  Galois  Field,  GF(2’'*).  An  N-input  MISR  is  a  case  of  GLFSR(m=  1 , 
N).  Therefore,  when  the  characteristic  polynomials  for  the  CBITs  are  designed  to  be  prime,  not 
only  the  test  patterns  are  maximum-length  but  quick  convergence  to  the  asymptotic  aliasing 
probability  can  be  guaranteed. 

Theorem  2  in  [1 1]  provides  a  general  formula  for  calculating  the  aliasing  probability  for  a 
one-stage,  N-bit  wide  MISR: 


^ai^Po'PvPv  ■■■'P(2^-\)^ 
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where  L  is  the  test  length  and  ^pQ,p■^,p2•  •P  ^  )  are  the  Walsh  transforms  of  the  error 

probabilities  from  an  N-bit  output  CUT.  When  pi  =  0,  there  is  no  error  detection  and  p,  =  1,  an 
error  will  always  be  detected. 
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Some  closed  forms  of  P^i  exist  with  additional  conditions  [1].  Here  we  present  two  of  the 
closed- form  by  choosing  the  bit-error  transition  probability,  p,  to  be  0.5  meaning  that  the 

probability  of  an  output  bit  being  erroneous  is  0.5; 

(i)  When  the  test  length  L  is  m(2''- I),where  m2  1  is  an  integer,  the  aliasing  probability 
Pal  is 

P^l(^)  =  2“^- where  m>l  is  an  integer.  (3) 

for  the  independent  error  model  [1]. 

(ii)  When  the  number  of  test  patterns  is  an  arbitrary  positive  integer  L  and  the  probability 
of  an  output  bit  being  wrong  is  0.5,  then  P^i  is 


Pai(^)  =  2"^  (2^-1) 


,/V-l  \L 


1  - 


(4) 


for  the  2^ -ary  symmetric-channel  error  model  [1]. 

Notice  that  for  both  cases  when  2^  is  much  greater  than  one,  P^i  converges  to  an  asymptotic 
value,  2  ''^.  Also,  when  the  test  length  L  is  less  than  one  maximum  length  for  the  N-bit  wide  MISR 
(i.e.,  2''^  -  1 )  for  the  independent  error  model,  Eq.  (2)  should  be  used  for  calculating  the  exact 
aliasing  probability  of  the  MISR. 

Aliasing  Probability  in  the  Pipelining  Scheme 

In  the  multi-stage  pipelining  MCM  testing  scheme,  the  aliasing  probability  for  the  A:-th 
stage  can  be  calculated  as 


(aliasing  probability  at  /:-th  stage) 

=  1  -  (non-aliasing  probability  over  k-stages) 


{\-kxP^^  + 


kx  (k-  \)  2 


kl 


Let  2^  be  much  greater  than  one,  which  is  generally  true  for  all  CBIT  suites,  P^i  of  the  N-input 
MISRs  can  be  approximated  to  2'^  for  all  the  values  of  m  or  L  in  the  above  formulae  (2)  and  (3). 
Then  Pj^  is  simplified  to 
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P,  =  y  (-1)'  '  (tttt— -TTi)  2  when  2^ »  1  for  pipes  constructed  by  N-inputMlSRs.  » 

*  j!  (k  -  0  ! 

i  =  1 

By  ignoring  the  contribution  from  the  higher  power  terms  smaller  than  2'^,  the  aliasing  probability 
for  the  yt-th  stage  pipelined  CBIT  converges  to 

Eq.  (5)  gives  the  asymptotic  value  for  both  the  symmetric-channel  error  model  with  any  test 
length  and  independent  error  model  with  test  length  at  least  one  maximum  length.  When  the 
number  of  stages,  k,  is  much  smaller  than  the  maximum  length  of  the  PRS  generated  by  the  MISR  » 

(i.e.,  2  ''-  1 ),  the  aliasing  probability  at  the  yt-th  stage  MISR  in  the  pipelining  scheme  is  of  the  same 
order  of  magnitude  as  0(2  '^).  This  is  also  validated  in  the  previous  simulation  result  in  [2],  where 
the  aliasing  frequency/probability  stays  as  a  constant  of  0(2  '^)  over  6  stages  in  the  pipelining  path. 

Thus,  the  randomness  of  the  signature  is  preserved  in  the  case  of  limited  number  of  multiple 
inputs,  and  the  aliasing  probability  is  sufficiently  small  given  that  k,  the  number  of  stages,  is  small 
compared  with  the  maximum  length  of  the  PRS. 

Other  LTA  Applications 

Capability  for  Testing  the  Interconnects 

The  interconnects  among  MCMs  can  be  tested  with  the  pipelining  scheme  by  integrating 
two  sets  of  CBITs  next  to  the  I/O  pins  in  each  module.  The  first  CBIT  set  operates  in  the  MISR 
mode  for  both  input  validation  before  the  signals  reach  the  internal  logics  and  TPG  for  the  internal 
logic  blocks.  The  second  set  of  CBITs  operates  in  the  MISR  mode  for  output  from  the  internal  logic 
circuitry  and  TPGs  for  the  interconnection  to  the  next  module.  Verifying  interconnects  among  the 
MCMs  are  viewed  as  the  simplified  CUTs  with  compatible  data  paths.  The  whole  system  can  be 
looked  as  many  CUTs  (including  the  interconnects)  to  be  tested  under  the  LTA  scheme. 

Fig.  5a  shows  one  CBIT  suite  placed  at  the  primary  outputs  (POs)  of  each  CUT.  The  zero¬ 
th  stage  CBIT  suite  is  added  in  order  to  generate  the  pseudo-random  test  pattern  for  the  1st  CUT. 

Neighboring  CBITs  can  be  cascaded  as  shown  previously  in  Fig.  3a  to  test  the  modular 
functionality  of  each  CUT.  However,  this  implementation  cannot  test  the  interconnections  among 
the  CUTs.  In  Fig.  5b,  extra  CBIT  suite  is  inserted  near  the  primary  inputs  (Pis)  of  each  CUT, 

Therefore,  we  always  have  two  CBIT  suites  testing  either  a  functional  block  or  an  interconnect 
pattern  between  two  CUTs.  This  implementation  provides  a  general  approach  which  can  test  any 
fault  patterns  in  all  permutations;  e.g.,  multiple  stuck-at  faults,  bridging  or  coupling,  pattern 
sensitive  faults,  etc.  The  reason  being  that  the  N-bit  wide  interconnect  network  often  realizes  fewer 
than  (2''  -  1)  different  patterns  for  implementing  signal  links  between  any  two  CUTs.  However, 
our  N-bit  wide  CBIT  suite  can  generate  (2''-  1)  different  test  patterns  to  exercise  the  N-bit  wide 
interconnect  exhaustively. 


•  •  •  • 


•  •  • 
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In  scheduling  the  testing  for  the  interconnects,  no  extra  modes  are  needed  nor  timing 
conflicts  exist.  This  is  because  we  use  two  sets  of  the  CBlTs  near  the  I/O  pins  w  hich  w  ill  transform 
the  interconnects  into  another  type  of  CUTs  directly.  Both  the  modules  and  interconnects  can  be 
tested  concurrently  in  this  pipelining  scheme. 

Area  overhead  resulting  from  ad  ipting  the  LTA  for  interconnection  testing  is  caused  by  the 
insertion  of  one  extra  CBIT  suite  near  the  input  ports,  which  makes  the  interconnects  observable 
Whereas  only  one  CBIT  suite  at  the  outputs  of  each  module  is  needed  for  performing  the  pipelining 
testing  for  the  modules.  By  comparing  the  two  schemes  in  Fig.  5,  to  test  the  interconnects  and 
module  functionality  concurrently  takes  one  more  CBIT  suite  but  saves  separate  mode(s)  for 
reconfiguring  the  SUT  to  test  the  interconnects.  Therefore,  with  a  little  area  penalty  ,  we  can  save 
a  lot  of  testing  time  by  testing  the  modules  and  interconnects  simultaneously  in  one  mode. 
Furthermore,  the  placement  of  CBIT  sets  still  applies  when  the  I/O  ports  are  moved  to  the  center 
of  the  modules. 

Fault  Location  with  Extra  Obserx  abiliry  of  Intermediate  CBlTs 

Signatures  of  the  CUTs  are  read  out  when  CBlTs  are  configured  in  the  scan  path  mode. 
Oftentimes,  a  wrong  final  signature  indicates  that  faults  exist  in  the  test  pipe.  However,  it  is 
possible  that  faults  from  different  CUTs  in  a  pipe  can  cancel  with  each  other  resulting  in  a  good 
signature  at  the  last  stage.  Therefore,  it  is  important  we  know  the  exact  test  length  applied  to  each 
CBIT  suite  such  that  signatures  of  the  intermediate  stages  can  be  made  observable,  thus  facilitating 
fault  location.  In  this  manner,  we  can  perform  better  diagnosis  by  locating  faults  in  some  specific 
CUTS. 

Examples 

We  develop  two  experiments  to  demonstrate  the  effectiveness  of  the  proposed  Loop 
Testing  Architecture.  The  first  experiment  involves  testing  a  homogeneous  processor  environment 
consisting  of  SN74LS 18 1/ALUs.  The  second  experiment  is  about  a  heterogeneous  MCM  system 
with  several  types  of  components.  Both  of  these  systems  are  transformed  into  test  pipes.  Results  in 
test  coverage  and  aliasing  probability  are  presented  for  discussion. 

Six-stage  ALU  Pipes 

Test  Setup 

Six  ALUs  form  a  pipe  with  16-bit  CBIT  suites  inserted  between  the  ALUs.  The  16-bit 
CBIT  suite  is  chosen  as  the  TPG  for  the  14-bit  input  ALU,  SN74LS18L  The  8-bit  output  of  the 
SN74LS181  is  fed  into  another  CBIT  suite  configured  for  signature  analysis.  In  this  experiment, 
we  developed  two  pipes  based  on  the  Loop  Testing  Architecture:  one  implements  primitive 
characteristic  polynomial  between  the  looped  CBIT  pairs  and  the  other  directly  connects  the 
feedback  lines  without  changing  the  feedback  pattern  of  each  CBIT  suite.  We  also  reconstructed 
the  straight  pipe  from  [2]  for  baseline  comparison. 


Randomness  of  the  TPG 

We  measure  the  randomness  of  the  test  pattern  generation  process  at  each  stage  of  the  three 
pipes  for  various  test  lengths.  The  purpose  is  to  evaluate  the  effectiveness  of  the  CBlTs  as  test 
pattern  generator  when  operating  in  the  MISR  mode  as  well  as  the  impact  of  pipe  length  on  the  test 
pattern  generation.  For  an  N-bit  wide  CBIT  suite,  the  randomness  measure  is  100%  if  2’''  test 
patterns  are  generated.  Fig.  6  show  s  the  randomness  measure  for  each  stage  of  the  three  different 
pipes.  In  all  three  configurations,  the  randomness  levels  off  after  the  first  stage  indicating  that  the 
length  of  the  pipe  does  not  affect  the  random  pattern  generation  process. 

Also  in  our  previous  observation,  the  required  test  length,  L,  for  the  two  N-bit  CUTs  under 
LTA  testing  (in  this  case,  m  =  2  for  Eq.  (1)),  should  be  smaller  than  (2’''  - 1 ).  This  is  validated  by 
the  simulation  result  in  Fig.  6  that  all  the  ALUs  in  every  cascading  stage  of  the  pipe  can  be 
e.xhaustively  tested  when  L  is  about  four  times  the  maximum  length  of  the  N-bil  wide  CBIT  suite. 
That  is.  instead  of  (2’'-l)  (or  even  (2’’*  -  1 ))  for  the  two  ALUs,  only  4  x  2''’ test  patterns  is  needed 
to  give  a  1(X)%  randomness  for  the  two  ALUs  at  each  cascading  stage. 

The  cascaded  CBIT  suite  implementing  the  LTA  outperforms  the  straight  pipe  in  producing 
the  best  random  patterns.  And  LTA  with  the  primitive  polynomial  is  better  than  that  implementing 
the  non-primitive  polynomial.  As  we  increase  the  test  length,  the  CBIT  suites  eventually  generate 
16-bit  wide  maximum-length  PRS.  This  validates  our  earlier  presumption  that  multiple  inputs  to 
the  PRS  generators  still  produces  the  maximum-length  PRS  [5j.  Regardless  of  how  the  8-bit  w  ide 
outputs  are  connected  to  the  CBIT  suite  (the  hi-'her,  lower,  or  even  the  middle  byte),  after  a 
sufficient  long  test  length,  e  g.,  four  times  of  the  maximum  length,  100%  randomness  of  the  16-bit 
wide  PRS  can  still  be  reached. 

Aliasing  Probability  of  Signature  Analysis 

Single  stuck-at-0  fault  is  insisted  at  one  output  bit  at  the  1st  stage  ALU  in  all  pipes  as  a  way 
to  introduce  faults.  Signatures  of  each  ALU  collected  after  certain  amount  of  test  patterns  being 
applied  to  the  pipes  are  compared  with  known  good  signatures.  Aliasing  occurs  when  the  signature 
of  the  faulty  pipe  results  in  the  same  signature  as  the  fault-free  pipes. 

We  compare  this  aliasing  probability  at  the  last  stage  of  the  three  pipes  resulting  from  a 
stuck-at-0  fault  at  the  least  significant  bit  of  the  output  of  the  1st  stage  ALU.  As  show  n  in  Fig.  7a. 
the  aliasing  probabilities  of  a  16-bit  CBIT  suite  approaches  0(2’*^)  limit  as  test  length  increases  in 
all  three  pipes.  The  straight  pipe  tends  to  get  aliases  early  for  shorter  test  lengths,  while  the  CBIT 
suite  implementing  the  LTA  do  not  exhibit  aliasing  effects  until  after  sufficiently  long  test  (In  this 
case,  the  aliasing  in  the  primitive  LTA  pipe  has  aliasing  at  the  6-th  stage  after  100  tests.).  However, 
when  the  test  length  is  smaller  than  the  length  of  one  maximum-length  PRS,  the  aliasing 
probability  is  more  pronounced.  Fig.  7b  shows  aliasing  occurs  at  the  later  stages  before  it  comes  to 
the  first  stage.  For  the  first  stage,  aliasing  occurs  after  more  than  1000  tests  are  applied.  Whereas 
the  6-th  stage  has  aliasing  at  test  length  less  than  10.  This  implies  that  there  needs  a  ‘warm-up’ 
period  of  the  pipelined  LTA  in  order  to  reduce  the  aliasing  probability  at  each  stage  for  small  test 
lengths. 

In  general,  the  aliasing  probabilities  in  the  CBIT  suites  implementing  the  LTA  are  smaller 
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than  the  straight  pipe  CBIT  suites.  When  only  one  CBIT  suite  in  the  last  stage  is  used  for 
comparison,  all  stay  at  0(2'’^)  (Fig.  7c).  If  the  contents  of  the  two  CBIT  suites  are  read  as  the 
complete  signature,  the  aliasing  probabilities  for  both  pipes  implementing  the  LTA  are  0(2'  “)  in 
our  single  stuck  fault  simulation,  whose  value  is  negligible  comparing  to  0(2’’^).  2  f'is  is  due  to  the 
e.xtended  width  of  the  CBIT  suite  of  2N,  2  x  16.  It  is  interesting  to  see  in  Fig.  7a  and  Fig.  7c  that 
the  cascaded  CBITs  with  non-primitive  characteristic  polynomials  give  the  same  asymptotic  value 
of  the  aliasing  probability  as  that  of  the  primitive  feedback  polynomials  discussed  in  [4]. 

Area  Over-head  and  Testing  Time 

The  area  overhead  for  implementing  the  LTA  consists  of  the  extra  wiring  to  cascade  the 
CBIT  suites  with  the  additional  XORs  implementing  the  primitive  generating  polynomial.  As 
mentioned  before,  no  extra  circuitry  is  needed  comparing  with  the  boundary'  scan  when  we 
construct  the  scan  path  with  the  cascaded  CBITs.  The  testing  time  for  the  LTA  pipes  are  the  same 
as  that  of  the  straight  pipe.  However,  with  a  little  bit  more  wiring,  the  LTA  pipes  provide  extra 
observability  at  each  stage  in  the  pipe  and  a  much  lower  aliasing  probability  with  the  extended 
signatures. 

Pipes  ^viih  ALU,  Caches  and  RAM  (P pipe) 

Test  Setup 

In  the  second  experiment,  an  MCM  consisting  of  one  SN74LS181  ALU,  one  8-bit  RAM, 
and  two  16x8  data  caches,  are  put  in  a  four-stage  testing  pipe.  One  16-bit  CBIT  suite  is  placed  at 
the  inputs  of  the  ALU  to  perform  TPG.  Four  8-bit  CBIT  cells  are  inserted  between  the  CUTs.  The 
test  patterns  from  the  8-bit  CBIT  connecting  to  the  inputs  of  the  RAM  are  de-.MUXed  to  test  both 
.Address  and  Data  inputs  exhaustively.  This  is  referred  to  the  straight  P  pipe  [3].  A  similar  LTA 
pipe  is  constructed  with  extra  connection  between  neighboring  CBITs  such  that  paired  8-bit  w  ide 
CBITs  can  perform  16-bit  signature  analysis  for  two  CUTs  simultaneously  (Fig.  4),  also  referred 
to  as  the  ’cascaded  P  pipe’. 

Randomness  of  the  TPG 

Fig.  8  shows  the  randomness  measure  of  each  stage  with  different  test  lengths  for  the 
straight  P  pipe.  In  Fig.  8a,  100%  randomness  is  reached  in  the  latter  stage  CBITs,  w  hen  the  input 

o 

test  length  is  greater  than  2  for  the  8-bit  analysis.  Also  in  Fig.  8b,  the  zero-th  stage  CBIT  gives 
100%  randomness  after  the  input  test  length  is  greater  than  four  times  the  maximum-lengths  for 
the  14-bit  wide  input  bus  to  the  ALU.  This  is  consistent  with  the  previou.,  experiment  that 
‘warming  up’  the  zero-th  stage  CBIT  can  improve  the  quality  of  the  TPG.  For  the  P  pipe  with 
cascaded  CBITs,  Fig.  9  shows  the  behavior  of  the  cascaded  P  pif)e,  which  is  very  similar  to  that  of 
the  straight  P  pipe.  This  reassures  us  the  earlier  assumption  that  maximum-length  PkS  will  be 
generated  as  the  test  length  is  at  least  four  times  the  maximum-lengths  for  a  given  data  width  of  the 
CBIT  suites. 

In  Fig.  8a  and  Fig.  9a,  only  four  times  of  the  maximum-length  PRS  generated  by  one  CBIT 
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suite  is  needed  to  test  those  neighboring  CUTs  in  the  LTA  (i.e.,  4x2*)  as  opposed  to  the  that  from 
a  double-width  CBIT  suite  (i.e.,  2*^).  There  is  a  difference  in  Fig.  9  from  Fig.  8  when  the  P-pipe 
uses  cascading  CBlTs.  The  convergent  rate  for  the  cascaded  CBITs  is  quicker  than  that  of  a  straight 
pipe.  This  is  again  similar  to  the  results  of  the  previous  ALU  pipes. 

Aliasing  Probability  of  Signature  Analysis 

The  single  stuck-at-0  fault  is  injected  to  the  least  significant  bit  of  the  output  from  the  ALU. 
Signatures  from  the  faulty  pipe  are  compared  with  those  from  a  fault-free  pipe  to  calculate  the 
aliasing  probability.  The  aliasing  frequencies  per  injected  fault  of  the  final  signatures  in  the  last 
stage  CBITs  of  the  two  P  pipes  are  shown  in  Fig.  10a.  The  signatures  are  read  byte-wise  from  each 
CBIT  cell  for  aliasing  frequency  calculation.  Complete  signatures  for  the  extended  CBIT  pairs  are 
taken  for  analysis  also.  All  aliasing  frequencies  per  injected  fault  the  8-bit  wide  signatures 

Q 

converge  to  the  asymptotic  value,  which  is  2  for  the  8-bit  case.  However,  the  P  pipe  with 
cascaded  CBITs  gives  a  smaller  byte-wise  aliasing  frequency  than  the  straight  P  pipe  when  test 
length  is  smaller  than  one  cycle  of  the  maximum-length  PRS,  i.e.,  2  .  Aliasing  for  the  extended 
signatures  does  not  occur  until  the  test  length  reaches  one  maximum-length  for  the  16-bit  signature 
analysis,  i.e.,  2*^  or  65536. 

Fig.  10b  shows  the  aliasing  frequencies  for  the  intermediate  stages  in  the  P  pipe  with 
cascaded  CBITs.  The  last  stage  still  gives  the  worst  analysis  result  as  that  of  the  ALU  pipes. 
Therefore,  'warming  up’  the  P  pipes  seems  to  improve  the  test  quality.  In  addition,  the  aliasing 
from  extended  signature  analysis  (in  this  case,  it  is  0(2'’^)-  1.52  x  10’^,  for  the  complete  16-bii 
wide  CBIT  suite),  is  plotted  for  comparison. 

In  the  above  experiments,  randomness  of  the  TPG  and  the  aliasing  problem  for  multiple 
stage  PSAs  are  analyzed.  As  the  input  test  length  traverses  through  the  whole  cycle  of  the 
maximum-length  PRS  provided  by  the  MlSRs,  all  states  in  the  PRS  will  be  visited  at  lease  once. 
Thus  we  have  lOO^c  randomness  of  the  maximum-length  PRS  during  the  TPG  process.  This  is 
especially  true  from  our  experimental  results  for  the  downstream  stages  in  one  pipe.  For  well- 
partitioned  pipelined  testing  path,  the  aliasing  probability  will  be  of  the  same  order  as  that  for  the 
serial  signature  analyzer  (SSA). 

Area  Overhead  and  Testing  Time 

To  implement  the  LTA  in  the  cascaded  P  pipe,  only  the  wiring  connecting  those  CBIT 
suites  are  needed  comparing  to  the  straight  P  pipe.  In  order  to  have  a  primitive  polynomial  for  the 
extended  CBIT  suite,  spare  XOR  gates  are  provided  for  most/least  significant  suite  configuration. 
In  terms  of  time  needed  in  the  testing  phase,  again,  we  only  need  two  modes  for  initializing  the 
CBITs  and  MISR  mode  analysis. 

Performance  Comparison  with  Other  Approaches 

LTA  with  CBITs  is  compared  with  other  testing  approaches  in  two  aspects:  testing  time  and 
area  overhead.  These  approaches  include  the  Boundary  scan  (JTAG  standard)  and  a  pipelined 
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BIST  with  conflict  scheduling.  Testing  time  is  calculated  by  adding  the  set-up  time  (Tj^f.^p),  the 
module  testing  time  (T^qJui^)  and  the  read  out  time  (T^ead-out)  normalized  to  the  average  testing 
time  per  module.  Area  overhead  is  also  calculated  by  adding  hardware  components  necessary  plus 
40-80%  extra  for  wiring. 

The  LTA  does  not  need  different  CBIT  cells  in  the  design  library  to  test  different  width  of 
the  data  paths.  For  a  wider  data  bus,  we  can  cascade  the  CBlTs  to  get  an  extended  PSA  without 
downgrading  the  quality  of  the  signature  analysis.  Thus  the  hardware  penalty  required  by  different 
sizes  of  BILBOs  can  be  eliminated.  (The  only  hardware  overhead  needed  by  the  LTA  is  the  zero¬ 
th  stage  CBIT  for  a  new  pipe  since  the  wiring  for  cascaded  case  is  negligible.) 

According  to  the  previous  work  done  on  the  performance  analysis  of  CBITs,  vs.  boundary- 
scan  [3],  CBIT  exhibits  less  than  10%  of  the  testing  time  while  requires  only  less  than  twice  of  the 
area  than  that  of  boundary-scan  designs.  In  both  cases,  the  fault-coverage  is  100%.  In  our  previous 
examples,  it  was  shown  that  in  the  6  stages  of  16-bit  pipelined  CBIT  suites,  the  aliasing  frequency/ 
probability  stays  as  low  as  0(2'*'^)  for  a  sufficient  long  test  length.  Therefore,  with  limited  area 
penalty  but  an  order  of  magnitude  improvement  of  the  total  testing  time,  the  LTA  can  drastically 
reduce  the  cost  of  MCM  testing  in  today's  competitive  market. 


Comparing  with  Boundary  Scan 

The  boundary  scan  approach  needs  two  separate  modes  and  carefully  selected  test  patterns 
to  test  the  processor  for  the  interconnect  failure  [8].  When  the  bit-width  of  the  communicating 
data  path  gets  higher,  the  more  complicated  test  patterns  and  test  cycles  are  required  for  the 
boundary  scan. 

Since  the  original  test  vectors  may  not  be  available  for  boundary  scan  test  in  the  MCMs 
using  ATPG  [15],  for  an  N-input  CUT,  the  number  of  test  patterns,  L,  needed  for  pseudo-random 
testing  with  boundary  scan  is  0(Cx  where  C  >  1  is  a  constant  given  by  a  statistical  estimation 
on  a  specific  test  pattern  generation  technique  [12].  The  optimized  value  for  C  is  one.  However,  to 
have  certain  test  confidence  that  the  most-difficult-to-detect  faults  are  covered,  a  larger  C  is 
required  [12].  The  total  time  needed  for  one  N-input  CUT  under  boundary  scan  testing  is 

.  (6) 


+  =  (Cx2^)  X  (r^ 


•  up  ^module  ^read  - 


where  and  tread-out  ^^e  scan-in,  one  execution,  and  scan-out  time  for  one  CUT. 

But  LTA  gives 


(7) 


where  4x2^  is  the  maximum  given  by  Eq.  (1)  for  m=2  and  k  is  the  total  number  of  stages  of  a 
pipe  implementing  LTA.  Thus,  LTA  saves  the  scan-in/scan-out  time  and  the  time  for  pseudo- 
exhaustive  testing  per  module/CUT,  especially  for  4«/:«2‘''.  For  example,  if  ~  clock 

cycles  {Iq  is  an  equivalent  number.),  i^et-up  ~  clock  cycles  and  tread-out  ~  clock  cycles 
for  a  16-bit  input,  16-bit  output  CUT,  the  boundary  scan  will  need  2097152C  -i-  65536fQC  clock 
cycles  to  finish  the  pseudo-exhaustive  testing  and  LTA  will  take  32  -l-  32768 tg  clock  cycles  when 


1 

i 

I 


» 


> 


> 


I 


•  • 


•  • 


14 


the  number  of  stages,  k,  is  8. 

Consider  the  interconnect  testing,  LTA  does  not  require  extra  testing  time  in  a  separate 
mode  that  boundary  scan  [8]  does.  So,  the  total  testing  time  per  module  and  its  interconnect  will 
still  be  of  0(r^„_„^  +  2^x  0)  +  W -««,))  since  LTA  can  test  both 

the  interconnect  and  the  processor  logic  at  the  same  time.  Here  we  write  down  tin,ejconnect  the 
time  for  signals  transferring  through  the  interconnection  network  although  its  value  is  negligible 
comparing  to  other  r’s  in  the  formula  (For  those  interconnects  with  long  propagation  delay, 
^interconnect  cannot  be  ignored.).  For  boundary  scan,  the  total  testing  time  per  CUT  with  its 
interconnects  will  be  in  the  order  of 


(Cx2^)  X  0, 


interconnect 


^  ^ read  -  out  -  ali^  ' 


where  tjet-up-aii  ^ind  tread-out-all  sum  of  scan-in  time  and  the  sum  of  scan-out  time  needed 

for  the  CUT  and  its  interconnect.  The  efficiency  of  LTA  during  the  interconnect  testing  is  again 
justified  in  saving  more  time  than  the  boundary  scan. 

Considering  the  area  overhead,  both  the  LTA  and  the  boundary  scan  approaches  need  five 
extra  electrical  pads  for  one  processing  element  [8].  However,  four  I/O  pins;  test-mode-select, 
test-reset,  scan-in,  and  scan-out  are  necessary  for  the  control  logic  for  the  boundary  scan [8]. 

In  general,  there  is  more  area  consumed  by  LTA  with  the  XOR  gates  to  implement  the 
extended  generating  polynomial.  Our  previous  experiment  in  testing  the  processor  MCM  in  [3] 
shows  LTA  takes  4240  transistors  in  the  four-stage  straight  P  pipe  case  and  boundary  scan  takes 
3040  transistors  in  total.  Therefore,  LTA  consumes  about  39%  more  area  than  the  boundary  scan 
in  this  example.  However,  with  limited  area  overhead,  LTA  provides  a  better  BIT  implementation 
with  better  state  coverage  and  exponentially  lower  aliasing  probability.  Furthermore,  LTA 
provides  not  only  the  extensibility  for  the  PSAs  in  terms  of  the  CBIT  implementation,  but  also  the 
pipelining  for  several  CUTs  to  be  tested  concurrently  to  save  the  testing  time  per  CUT. 


Comparing  with  Pipelined  BIST  with  Other  Conflict  Scheduling 

Other  pipjelined  BIST  approaches  in  [7]  alternates  separate  modes  of  TPG  and  PSA  in  one 
LFSR  circuit.  Implementation  in  [6]  gives  one  stage  analysis  for  all  the  CUTs  in  a  pipelined  data 
path  by  centralizing  the  TPG  and  distributing  the  PSA;  i.e.,  one  set  of  BILBOs  as  TPGs  for  all  pipes 
with  one  stage  BlLBO-CUTs-BILBO  structure  and  separating  outputs  to  several  PSAs.  Conflict 
tables  would  have  to  be  established  to  make  sure  the  LFSRs  perform  TPG  and  PSA  separately  in 
different  testing  schedules.  Here  we  denote  them  as  the  pipelining  testing  with  conflict  scheduling. 
The  total  averaged  testing  time  for  one  N-input  CUT  (‘kernel’  from  [7])  can  be  calculated  as 

T  +  T  +  T 

set-up  module  read-out 

=  ^set-up+  (‘module  +  (2^  -  1)  XD)  (8) 

where  2^  is  used  as  the  number  of  test  patterns  from  [7]  and  D  is  the  latency  between  the  current 
TPG  and  its  immediate  predecessor  (TPG).  The  minimum/optimized  value  for  D  is  one  clock 
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cycle.  Usually  D  is  greater  than  one  clock  cycle  since  the  new  test  pattern  cannot  be  generated  until 
the  previous  pattern  is  generated  and  loaded  to  the  CUT/kemel  when  the  bus  is  available. 

By  comparing  Eq.  (7)  and  Eq  (8),  even  though  optimized  pipeling  schedule  can  be  set  for 
the  best  » Jue  of  D,  the  conflict  table  requires  more  testing  time  then  that  of  LTA.  This  is  because 
our  LTA  approach  utilizes  the  fundamental  characteristics  of  the  MISRs  to  operate  simultaneously 
as  TPG  and  PSA.  Thus  LTA  eliminates  the  waiting  time  for  the  available  register  and  bus  to  put  a 
separate  test  pattern  for  the  CUT/kemel.  In  addition,  the  possibility  that  more  MISR  circuit  or 
interconnects  required  by  the  conflict  scheduling  for  separating  TPG  and  PSA  modes  does  not 
occur  in  the  LTA.  For  a  16-input  and  16-output  CUT,  the  testing  time  needed  from  Eq.  (8)  is  at 
least  65567  + clock  cycles  (again,  clock  cycles)  which  is  greater  than 

32  +  32768fQ  clock  cycles  given  by  the  8-stage  LTA  since  Iq  should  be  at  most  one  clock  cycle. 

Because  of  the  dual  TPG  and  PSA  mode  provided  by  the  LTA,  the  exhaustive  testing  time 
is  tremendously  reduced  by  cutting  the  time  for  scheduling  conflicts  on  one  MISR  (the  D  value  of 
Eq.  (8)).  In  addition,  the  extensibility  given  by  LTA  both  horizontally  for  bit-size  changes  and 
vertically  for  multiple  CUTs  to  be  tested  in  a  pipe  provides  best  utilization  of  the  parallel  testing. 
Pipelining  and  parallelism  can  be  performed  on  one  system  with  minimized  design  modification 
and  optimized  test  scheduling. 

Without  losing  the  effectiveness  of  the  test  coverage,  the  LTA  scheme  requires  less  testing 
time  comparing  to  boundary  scan  and  pipelining  with  conflict  scheduling  approaches.  No 
significant  area  overhead  comparing  the  former  two  implementations  except  the  spare  XOR  paths 
for  cascadability  is  anticipated  in  the  LTA.  Furthermore,  we  suggest  that  by  re-arranging  the 
placement  of  the  GBIT  circuits  and  test  scheduling,  it  is  possible  we  can  gain  high  testability/ 
observability  of  the  permanent  faults  both  in  the  processor  and  in  the  interconnect. 

Conclusion 

In  this  paper,  a  cascadable  Built-in  tester  (CBIT)  is  proposed  to  test  MCM  modules 
configured  in  a  pipelined  fashion.  CBITs  can  be  cascaded  to  match  the  data  width  of  the  CUTs  and 
have  been  shown  to  exhibit  high  test  coverage  with  100%  randomness  in  the  TPG  process  and  low 
aliasing  probability  in  signature  analysis.  CBIT  circuit  can  also  serve  as  a  switching  device  for 
module  reconfiguration.  This  is  achieved  by  the  extra  routing  resources  offered  to  the  packaged 
MCM  modules  by  current  design  houses.  These  routing  layers  can  be  used  to  reconfigure  the 
interconnection  when  a  die  is  diagnosed  faulty. 

We  also  introduced  the  Loop  Testing  Architecture  as  a  way  to  reduce  aliasing  probability. 
When  compared  to  the  GLFSR  approach  [1],  LTA  gives  similar  aliasing  probability  as  that  of  the 
two-fold  GLFSR.  LTA  implementation  can  also  be  applied  when  the  I/O  ports  are  moved  to  the 
center  of  the  chip  area  in  the  future  system  design. 

LTA  is  more  efficient  when  the  MCM  testing  sessions  can  be  partitioned  into  several  sub¬ 
circuits  for  parallel  testing.  Partitioning  algorithms  using  netlist  as  inputs  can  be  found  in  [18] 
while  partitioning  at  a  higher  level  is  proposed  in  [16].  In  depth  analysis  of  the  testability  issues  for 


partitioned  CUTs  is  discussed  in  [17],  Part  of  our  future  work  will  emphasize  on  integrating 
partitioning/clustering  algorithms  with  LTA,  such  that  hierarchical  functional  test  methodology  for 
MCM  can  be  automated.  Work  on  modification  to  the  LTA  and  CBIT  circuitry  to  accommodate 
the  capability  for  interconnect  reconfiguration  and  self  purging  in  the  MCM  is  also  expected  to 
improve  fault  tolerance. 
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Fig.  2(a)  parallel-in/parallel-out  register/buffer  mode 
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Fig.  2(b)  Scan-in/Scan-out  shift  register  mode 


Fig.  2(c)  parallel-in/parallel-out  LFSR  mode 
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Fig.  3  Examples  for  constructing  LTA  pipes  with  paired  CBlTs  in  the  system  under  test  (SUT) 


Fig.  4  Data  path  of  the  m-stage  pipelined  extended  CBIT  testing 
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Fig.  6  Randomness  measure  of  the  CBITs  as  TPG  over  6  stages  in  the  ALU  pipes  (ML=2**^) 
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Fig.  6  Randomness  measure  of  the  CBITs  as  TPG  over  6  stages  in  the  ALU  pipes  (ML=2  °) 
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Fig.  7(a)  Aliasing  frequency  at  the  6-th  stage  for  three  ALU  pipes 
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Fig.  7(b)  Aliasing  frequency  for  cascaded  CBITs  with  primitive  polynomial  in  the  ALU  pipe 
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Fig.  7(c)  Aliasing  frequency  for  test  length=4ML  over  6  stages  of  the  ALU  pipes  Stages 
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Fig.  10(a)  Aliasing  frequency  at  4-th  stage  for  two  P  pipes 
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Fig.  10(b)  Aliasing  frequency  for  cascaded  CBITs  with  primitive  polynomial  in  the  P  pipe 


